Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement canonical domain update script #348

Merged
merged 7 commits into from
Nov 15, 2024

Conversation

m453h
Copy link
Contributor

@m453h m453h commented Nov 5, 2024

This PR introduces a script for updating incorrect canonical domains for documents stored in Elasticsearch.

The implementation of this script uses:

  • The scroll API for the retrieval of documents in batches with configurable batch size.

  • The Bulk Helpers to perform batch updates across multiple documents simultaneously, significantly reducing the number of requests sent to Elasticsearch.

Example of the usage of the script :

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='{                                                                                                                        
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "canonical_domain": "mediacloud.org"
                    }
                }
            ]
        }
    }
}'

Addresses #345

====

Update

The following improvements have been done following observations from @philbudne :

  • To address concerns regarding the usage of scroll API the implementation has been updated to use search_after to retrieve documents that should be updated.

  • Support for query_string has been added thus the script can also be called as follows:

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='canonical_domain:mediacloud.org AND original_url:"http\://mediacloud.org/need_canonical_url"' --type=query_string

Copy link
Contributor

@thepsalmist thepsalmist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can just standardise logging to using %s as per this comment

@philbudne
Copy link
Contributor

Some thoughts/questions:

  1. Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!
  2. Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them
  3. The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??

@m453h
Copy link
Contributor Author

m453h commented Nov 7, 2024

Some thoughts/questions:

  1. Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!
  2. Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them
  3. The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??

Hey @philbudne interesting thoughts and questions:

  1. On adding query string: I believe I can modify the script so that we could support both DSL and query_string, the user would only pass the query_string as:
    original_url:"http://mediacloud.org/need_canonical_url" AND canonical_domain:"blogdomago.com"

    But we would construct the query as:

    "query": {
           "query_string": {
                "query": "original_url:\"http://mediacloud.org/need_canonical_url\" AND canonical_domain:\"mediacloud.org\""
              }
     }
    

With this validation, execution would just remain the same without needing a lot of modification. Let me work on this and push the changes.

  1. Regarding the example: The script accepts query as an argument so we can always change the query as we see fit to match the documents that will need to be updated at the time of executing the script.

  2. Regarding the use of Scroll API: I did see the warning, regarding scroll API, I did a bit of more research before the implementation in most cases I see the issue with scroll API boils down to maintaining the search context on the server which is costly.I went with Scroll API since:

    • The script would be executed once as a maintenance task, meaning we avoid the risk of having too many search contexts open. Based on the documentation, it states: "Scrolling is not intended for real-time user requests, but rather for processing large amounts of data, such as reindexing the contents of one data stream or index into a new data stream or index with a different configuration."
    • Scroll API is provided by the scan helper out of the box which seems to best fit this scenario.

I'm open to making changes and utilize search_after.

Copy link
Contributor

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
indexer/scripts/elastic-canonical-domain-update.py Outdated Show resolved Hide resolved
@m453h m453h requested a review from kilemensi November 15, 2024 11:41
@m453h m453h merged commit e3b8c1d into mediacloud:main Nov 15, 2024
@m453h m453h deleted the canonical_domain_update_script branch November 15, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants